Network Model-Assisted Inference from 
Respondent-Driven Sampling Data 



August 2, 2011 

Abstract 

Respondent-Driven Sampling is a method to sample hard-to-reach 
human populations by link-tracing over their social networks. Be- 
ginning with a convenience sample, each person sampled is given a 
small number of uniquely identified coupons to distribute to other 
members of the target population, making them eligible for enroll- 
ment in the study. This can be an effective means to collect large 
diverse samples from many populations. 

Inference from such data requires specialized techniques for two 
reasons. Unlike in standard sampling designs, the sampling process 
is both partially beyond the control of the researcher, and partially 
implicitly defined. Therefore, it is not generally possible to directly 
compute the sampling weights necessary for traditional design-based 
inference. Any likelihood-based inference requires the modeling of 
the complex sampling process often beginning with a convenience 
sample. We introduce a model-assisted approach, resulting in a design- 
based estimator leveraging a working model for the structure of the 
population over which sampling is conducted. 

We demonstrate that the new estimator has improved performance 
compared to existing estimators and is able to adjust for the bias in- 
duced by the selection of the initial sample. We present sensitivity 
analyses for unknown population sizes and the misspecification of 
the working network model. We develop a bootstrap procedure to 
compute measures of uncertainty. We apply the method to the es- 
timation of HIV prevalence in a population of injecting drug users 
(IDU) in the Ukraine, and show how it can be extended to include 
application-specific information. 



Keywords: Hard-to-reach population sampling; Link-tracing; Network 
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1 Introduction 

There is much interest in estimating features of hard-to-reach human pop- 
ulations. Such populations are characterized by the lack of a serviceable 
population sampling frame. In some settings, the target population is 
well-connected by a network of social relations. Link-tracing sampling 
strategies such as snowball sampling (Goodman, 1961) and respondent-driven 
sampling (RDS) (Heckathorn, 1997) are often used to leverage those social 
relations to sample beyond the small subgroup available to researchers. 
In these settings, subsequent samples are identified and selected based on 
their social ties with other members of the target population. The statis- 
tical literature dealing with such strategies (Frank, 1971; Goodman, 1961; 
Thompson, 1990; Thompson and Frank, 2000), typically assumes an ide- 
alized setting in which the initial sample is assumed to be a probability 
sample from the target population. The applied literature on the other 
hand Trow (1957); Watters and Biernacki (1989), has traditionally recog- 
nized that this is impractical, and therefore treated link-tracing samples 
(typically referred to as snowball samples, despite Goodman's probabilis- 
tic framing) as convenience samples for which probability-based inferen- 
tial methods are unfounded. 

The work of Heckathorn and colleagues (Heckathorn, 1997, 2007; Sal- 
ganik and Heckathorn, 2004; Volz and Heckathorn, 2008) around the RDS 
specialization of link-tracing sampling is innovative in reducing the num- 
ber of links followed per respondent, such that many waves of sampling 
are fostered, decreasing the dependence of the final sample on the initial 
convenience sample. The second main innovation of the RDS paradigm 
is in the respondent-driven nature of the sampling process in which sub- 
sequent samples are selected by the passing of coupons by current sam- 
ple members, thus reducing the confidentiality concerns often present in 
hard-to-reach marginalized populations. While this approach does reduce 
the dependence of the final sample on the initial sample, it is possible for 
substantial bias to remain based on the initial sample of seeds, as studied 
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in simulations by Gile and Handcock (2010) and illustrated empirically 
by Johnston (2010). Current estimation methods (Gile, 2011; Heckathorn, 
1997, 2007; Salganik and Heckathorn, 2004; Volz and Heckathorn, 2008), 
however, do not correct for biases introduced by seed selection. A com- 
mon feature of networked populations is that the social ties are often more 
likely to occur between people that have similar attributes than those who 
do not, a tendency called homophily by attributes (Freeman, 1996; Lazars- 
feld and Merton, 1954; McPherson, Smith-Lovin and Cook, 2001). In this 
paper we present a novel approach and inferential frame to correct for 
bias introduced by seed selection and for the effects of homophily. In par- 
ticular, we treat the problem of estimation of the population proportion 
of a binary covariate in populations where there exists homophily on the 
covariate of interest, based on a branching link-tracing sample beginning 
with seeds selected with bias with respect to that covariate. 

There is a varied formal statistical literature on inference from link- 
tracing network samples. All of this work, however, involves the assump- 
tion that the initial sample is a probability sample drawn from a well- 
defined sampling frame, and that subsequent sampling is adaptive, or de- 
pendent on population characteristics only through their observed por- 
tions (Thompson and Seber, 1996). In the design-based framework, these 
works consider cases where sampling probabilities are known for all units 
in the analysis (Frank, 1971, 2005; Goodman, 1961; Thompson, 1992, 2006). 
Inference is then made without reference to any superpopulation model. 
In the likelihood frame, the literature treats cases where the adaptive sam- 
pling process is amenable to the model, and therefore the modeling can 
be conducted without explicit treatment of the sampling process (Hand- 
cock and Gile, 2010; Pattison, Robins, Daraganova, Wang, Koskinen and 
Snijders, 2009; Thompson and Frank, 2000). The traditional approach to 
RDS, originally due to Heckathorn (1997), represents an alternative to this 
paradigm. The assumption of the original probability sample is replaced 
by an assumption of sufficient waves of sampling to adequately reduce 
the dependence of the sample on the original sample. 

In this paper, we concern ourselves with a case in which none of these 
approaches suffice. The sampling probabilities of the units are not known, 
making the traditional design-based approaches inadequate. The initial 
sample is not a probability sample, so the sample is not adaptive or amenable, 
and any likelihood inference must consider the sampling process as well 
as the population model. Such a joint modeling approach has been con- 
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ducted in a few works (Felix-Medina and Monjardin, 2006; Felix-Medina 
and Thompson, 2004; Frank and Snijders, 1994), but each of these requires 
an initial probability sample from some frame to allow for modeling of the 
sampling process. And while in some cases, the waves of sampling may 
be sufficient to suitably reduce the dependence on the initial sample, this 
is often not the case (Gile and Handcock, 2010), and we are interested in 
the cases when this does not hold. 

We begin in Section 2 by introducing respondent-driven sampling. In 
Section 3, we then present our Model- Assisted inferential approach. Sec- 
tion 4 presents a simulation study illustrating the removal of bias intro- 
duced by the initial convenience sample. Our application to HIV preva- 
lence estimation among injecting drug users in the Ukraine can be found 
in Section 5, and Section 6 presents a discussion and concluding remarks. 

2 Respondent-Driven Sampling 
2.1 Notation 

We assume the target population consists of N people (nodes) with labels 
1, . . . , N. Let the iV-vector z, represent a binary nodal outcome variable of 
interest. We refer to this variable as "infection status", such that 



We assume the target population is connected by a network of mutual 
relations with N x N adjacency matrix y: 



and that this network forms a single connected component. Denote by 
dj = J2j yij tne nodal degree, or number of network ties or alters of node i. 
Let d = {d 1; . . . , d N }. Denote by Xj = J2j z jYij the number of network ties 
node i shares with infected nodes, and let x = {x 1; . . . , xjy}. 

2.2 Sampling Procedure 

We consider an RDS procedure of the following form: 




i not infected 

1 % infected. 




1 i and j connected 

% and j not connected, 
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0. An small initial sample is selected from the population members 
accessible to researchers, typically using a convenience mechanism. 
They are typically 3-12 in number. They are called the seeds and com- 
prise wave k = of the sample. 

1. Each member of wave k is given a small number (typically 2-3) of 
uniquely identified coupons to distribute among their alters. 

2. Coupon recipients returning their coupons to the study center are 
subsequently enrolled in the study. A person recruited in a prior 
wave can not be recruited again. The wave number of a respondent 
is one more than that of their recruiter. 

3. Steps (1) and (2) are repeated until the desired sample size is at- 



This process has proved effective at recruiting large and diverse sam- 
ples from many hard-to-reach populations (Abdul-Quader, Heckathorn, 
McKnight, Bramson, Nemeth, Sabin, Gallagher and Jarlais, 2006), and has 
been widely used. It has been heavily used in the monitoring of disease 
prevalence and risk behaviors among high-risk populations such as sex 
workers, men who have sex with men, and injecting drug users (Malekine- 
jad, Johnston, Kendall, Kerr, Rifkin and Rutherford, 2008), largely in the 
service of the reporting requirements of UNAIDS for all countries with 
concentrated HIV epidemics (UNAIDS, 2008). It is also used by the US 
Centers for Disease Control and Prevention in the behavioral monitor- 
ing of injecting drug users in 25 large US cities (Lansky, Abdul-Quader, 
Cribbin, Hall, Finlayson, Garfein, Lin and Sullivan, 2007), and has also 
been used in other populations such as unregulated workers (Bernhardt, 
Heckathorn, Milkman and Theodore, 2006) and jazz musicians (Heckathorn 
and Jeffri, 2001). 

We represent the full sampling mechanism by the random variables: 



tained. 




person i is sampled in wave k 
otherwise 



i e 1 . . . N, k e 0, . . . . 




oo 



1 person % is sampled 
person % is not sampled 



% e 1 . . . N, 



k=0 
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and let s k denote the observed sampling vector corresponding to the peo- 
ple sampled in wave k. Based on the sampling procedure, we exactly ob- 
serve the elements of z, d and x corresponding to i : Sj = 1. A variant when 
x cannot be observed directly, as in the application in Section 5, substitutes 
an estimate of x based on observed referral patterns. 

Further we assume each respondent distributes a number of coupons 
completely at random from among their alters, with the number deter- 
mined by a common distribution. 



2.3 Design-based Inferential Approach 

We consider design-based estimators for the population mean /i = ^2f =1 Zj. 
Because the sampling probabilities of the people selected through RDS 
are almost never explicitly known, we follow Volz and Heckathorn (2008), 
and Gile (2011) in constructing a model for the sampling process, and esti- 
mating sampling probabilities accordingly. We use a generalized Horvitz- 
Thompson estimator of the form: 

V- = t . (!) 

2^j=l 7T; 

where estimated sampling probabilities 7Tj = E(Sj|^) are computed under 
an approximation y to the true RDS sampling process. If the inclusion 
probabilities were known this estimator is referred to as the Hajek esti- 
mator (Lumley, 2010), and typically performs better than the correspond- 
ing Horvitz-Thompson estimator (Sarndal, Swensson and Wretman, 1992). 
The estimators introduced by Volz and Heckathorn (2008) and Gile (2011) 
differ, and ours further differs, in their specification of the sampling pro- 
cess y. 

Most inference from RDS data approximates the sampling process as a 
with-replacement random walk on the space of graph nodes, with transi- 
tions along the edges or social relations. For the purpose of inference, sam- 
pling is treated as a Markov chain at equilibrium (Salganik and Heckathorn, 
2004; Volz and Heckathorn, 2008). Such inference involves sampling weights 
proportional to the self -reported degrees which are the equilibrium sam- 
pling probabilities of the with-replacement random walk on a connected 
network. While this is a useful first approximation, it has several limita- 
tions. First, as highlighted in Gile (2011), this type of inference does not 
respect the without-replacement nature of the sampling process, which 
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can lead to biased estimates. Gile (2011) presents an approach correcting 
for this feature by substituting a without-replacement successive sampling 
approximation to the sampling process. Neither this, nor earlier estima- 
tors, however, address the fundamental issue of bias induced by the se- 
lection of the initial sample. Such bias is illustrated in Gile and Handcock 
(2010), as well as in the current paper, and correction for it is a key contri- 
bution of the present paper. 

As with these earlier approaches, the first requirement of our sampling 
model is that it account for the different sampling probabilities by nodal 
degree. Unlike these other approaches, we further require our approach 
to account for the bias introduced by the selection of seeds in the presence 
of network homophily in the underlying population. This requires con- 
sideration of features of the social network, y, in particular the homophily 
of the relations. 

We make no assumptions about the mechanism for selecting the ini- 
tial sample and will condition on the seed characteristics throughout the 
analysis. 

If the network y were fully known, we could use simulation to estimate 
the sampling probability = E(Si\y, y, s°) of each node, conditional on 
the selection of seeds, s°. Explicitly, we would repeatedly simulate RDS 
under sampling model y starting from s° each time and compute the 7Tj y 
as the proportion of simulated samples containing node i. These could be 
used in (1) to form an estimator. Unfortunately, y is typically only partially 
known, and so we apply a model-assisted approach. 

3 A Model- Assisted Approach 

Our approach is an extension of the model-assisted design-based approaches 
presented in Sarndal et al. (1992). Existing work in this area uses a work- 
ing model form to construct estimators that are (approximately) design- 
unbiased, whether the model holds or not, and have smaller design vari- 
ance if the model does hold. Our case is slightly different. The sam- 
pling process we consider is only locally defined, and originates at a sam- 
ple with unknown distribution. We therefore cannot guarantee design- 
unbiasedness. In fact, we require reference to a model form to recover 
approximate design-unbiasedness, rather than to improve efficiency. This 
is because the impact of the seed characteristics on the subsequent sample 
is mediated by the structure of the underlying social network. 
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Our approach is to assume a working superpopulation model from 
which the network was drawn and use it to estimate sampling probabili- 
ties conditional on the selection of the initial sample. 



3.1 Network Working Model 

We consider models of exponential-family random graph model (ERGM) form 
(Frank and Strauss, 1986; Hunter, Goodreau and Handcock, 2008; Hunter 
and Handcock, 2006), conditional on the set of nodal degrees and infec- 
tion statuses, and including a single additional parameter representing 
homophily on z. In particular: 

P(Y = yM ,,)=*», (2, 

c(77|z,d) 

where g(y,z) = £fj=i y^z^l-z^), and c(r]\z, d) = £ ue ^ (Z)d) exp(^(u,z)), 
and the space ^(z, d) consists of all binary undirected networks consistent 
with d and z (the dependence on d and z is suppressed below). Note that 
this model form, as well as the simulation procedure to follow, requires 
knowledge of the population size N. 

Given this model form, we use the estimator (1) based on sampling 
weights assumed constant over equivalence classes by degree and infec- 
tion status and estimated under the model: 

7r^ = E(S l | t r,z,d,r ? ,s )= ^ y P(Y = y|z,d,r ? ), 

ye^(z,d) 

where n^y = E(S i \S fi , y, s°), as defined in Section 2.3. Note that to treat 
these equivalence classes, we condition on the equivalence classes of the 
seed nodes selected, rather than the unique identities of those nodes. 

We also do not know the network working model parameter r], and 
therefore must estimate it from the available data. The estimator is then 
computed using sampling probabilities based on the estimated network 
working model given by f): 

7r^ = E(S^,z,d,r),s°)= ^ 7r i>y P(Y = y|z, d, fj) (3) 

ye^(z,d) 

These are the estimated probabilities used in our proposed estimator. 
This requires fitting a network working model to data sampled through 
RDS, which we address in the next section. 
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3.2 Fitting the Network Working Model 



Thompson and Frank (2000) and Handcock and Gile (2010) provide an 
approach to fitting models of form similar to (2) to data sampled through 
link-tracing samples. Unfortunately, these approaches require a sample 
that is amenable to the model in question. That is: 



P(S|y, d, z) = P(S\y obs , d ofes , z obs ), (4) 

where * b. s represents the observed part of *, and also that the sampling 
and model parameters are separable. This is equivalent to the conditions 
for ignorability according to Rubin (1976) and Little and Rubin (2002). Un- 
fortunately, in the case of RDS, condition (4) is violated by the convenience 
sample of seeds, which may well depend on unobserved characteristics. 

Therefore, we require a novel approach to model fitting. As d and z are 
unknown, we construct design-based estimators of them from estimates of 
the sampling probabilities 7Tj. Specifically, let N kt be the number of nodes 
of degree k and infection status I, k e {1, . . . , N — 1}, I e {0, 1} and N = 
{Hw}fc=i^=o i=1 - We estimate N and g(y, z) by, 

N *f ^ = = (5) 

i=l l 

g{y,x) = 2_^ 7p ( 6 ) 

where is the indicator function on *, and iti is assumed constant for all 
% : di — k, Zj = I. Note that this requires the observation of {xj : Sj = 1, i — 
1, . . . , iV}. We then estimate rj as the natural parameter corresponding to 
the mean value parameter g(y,x) with the joint degree and infection sta- 
tus sequence implied by N. Details of this computation are given in the 
Supplemental Materials. 



3.3 Algorithm 

Note that the value of the network working model parameter, required 
to estimate it, in turn, depends on the value of n. We therefore apply an 
approach similar to self-consistency (Lee and Meng, 2007) to find a joint 
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solution to (5) and (6), as well as to the equations: 



E(s^z,d,77(N,s(y,x))) 



i = l,...,N. 



(7) 



This approach iterates between estimating the network working model 
parameter given values for the sampling probabilities, and then estimating 
the sampling probabilities given the network working model parameter. 
Explicitly, it is: 

• Estimate proportional to degree dj. 

• Iterate the following steps: 

- Compute design-based estimates of statistics N w and g(y, x) us- 
ing iifj in (5) and (6). 

- Determine the working ERGM parameter r] corresponding to N 
and (?(y,x). 

- Simulate M networks according to the working ERG model. Es- 
timate Ttfj by simulated RDS sampling from the resulting net- 
works. 

• Use the resulting estimated probabilities, 7r^, to form the weighted 
estimator of the quantity of primary interest: 



The iterative nature of this procedure is similar to that used for the 
successive sampling estimator of Gile (2011). This algorithm differs in the 
core process of estimating the inclusion probabilities. More details of this 
procedure are provided in the supplemental materials. 

The simulation procedure implicit in this estimation algorithm lends 
itself to a realistic bootstrap approach to standard error estimation. We 
present such a bootstrap in the supplemental materials, along with a sim- 
ulation study illustrating its performance under a variety of conditions. 



I^MA — 




(8) 



N Si 

1=1 TTf. 
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4 Comparing the Model-Assisted to Existing Es- 
timators: A Simulation Study 

Gile and Handcock (2010) present an extensive simulation study of RDS 
based, where possible, on a set of realistic characteristics of data from the 
CDC pilot study of RDS (Abdul-Quader et al., 2006). For comparison pur- 
poses, our simulation study uses the same simulated populations as Gile 
(2011), along with extensions necessary for our sensitivity analyses. 

4.1 Study Design 

Our simulation study is designed around three levels of simulation: 

• The generation of random networks according to specified network 
features 

• The generation of simulated RDS samples from each network 

• The estimation of infection prevalence from each set of simulated 
sample data. 

We use variants of the network and sampling parameters to study the be- 
havior of the proposed estimator. Descriptions and levels of these param- 
eters are listed in Tables 1 and 2. 

Table 1: Parameters of simulated networks. Default parameters given in 
boldface. 



Parameter 


Meaning 


Values 


Number of nodes 




1000, 715 


Prevalence 




0.20 


Mean degree 


d=^E£ = iE(*) = |ES=iE(^) 


7 


Homophily 


where N° = N(l — /i), iV 1 = N/j, 


5, 3,1 


Activity 
Ratio 




1,1.8 



To allow for comparability across simulation conditions, throughout 
our simulations, we maintain the same true recoverable prevalence, fi = 
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0.20, the same sample size n = 500, and the same mean degree d = 7. We 
consider variations on the population size (hence the sample fraction), the 
degree of clustering or homophily on infection status, and differential rates 
of tie formation by infection status (or activity ratio). The parameter levels 
considered are summarized in Table 1. 

Under each set of network parameters, networks are simulated accord- 
ing to an ERGM with sufficient statistics: 

N 

i=l j<i 
N 

0»(y) = ££yo-(i-*)(i-3i) ( 9 ) 

i=l j<i 
N N 

i=l j=l 

These three terms correspond to the unique cells of the 2x2 mixing matrix 
on z, and for a given number of nodes N and prevalence /i, are uniquely 
defined by d, R, and w. Note that this model is similar to (2), but not 
identical. While (2) conditions on the fixed degree of each node, this model 
allows for stochastic variability in degrees around mean value parameters 
given by (9). 

From each simulated network, a single RDS sample is drawn accord- 
ing to parameters in Table 2. A fixed number n° of seed nodes are selected 
with probability proportional to degree (the best case for the earlier esti- 
mators), from either the full population or from the infected nodes only (to 
simulate extreme seed bias). The simulated process treats the case of two 
coupons distributed by each respondent completely at random among its 
previously un-sampled alters. Two coupons are chosen for simplicity, and 
because it represents the sampling process better than either 3 (equating 
to the return of all coupons in practice) or 1 (resulting in non-branching 
chains). 

For each simulation case, we simulate 1000 networks with one RDS 
sample from each, and we compare five estimators, as summarized in Ta- 
ble 3. 
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Table 2: Parameters of simulated RDS sampling. Default parameters given 
in boldface. 



Parameter 


Meaning 


Values 


Number of Seeds 


n° = E, S° 


10, 6, 20 


Seed Selection 


Sequentially with probability proportional 
to degree from either: 


full population, 

infected nodes 


Branching 


From each sampled node, up to n cup 
previously unselected alters are selected 
completely at random for subsequent sampling. 
n cup are selected whenever available. 


2 


Sample Size 


Sampling stops when n nodes have been sampled. 


500 



Table 3: Five estimators compared in the simulation study. 



Abbreviation 


Source 


Estimator 


Mean 


Naive sample mean of Zj 


p, 


SH 


Salganik and Heckathorn (2004) 




VH 


Volz and Heckathorn (2008) 


AvH 


SS 


Gile (2011) 


frss 


MA 


Current paper 





4.2 Primary Results 

We begin by studying the performance of the proposed estimator in set- 
tings where previous estimators have been found to perform well. In 
the first part of Figure 1, there is a relatively small sample fraction (50%, 
N = 1000), no homophily on infection status (R = 1), the ratio of mean de- 
grees by infection (w) is 1, and seeds are chosen from the full population, 
so there is no bias induced by seed selection. In this case, none of the es- 
timators considered exhibit bias, and the naive sample mean exhibits the 
lowest variance, although the variability is similar across estimators. 

The second part of Figure 1 illustrates the case fiss is designed to ad- 
dress. In this case, the sample fraction is large (about 70%, iV = 715), 
and infected nodes have mean degree 80% higher than that of uninfected 
(w = 1.8). In this case there is still no homophily (R = 1), and no seed bias. 
Here, the higher-degree infected nodes are over-represented in the sam- 
ple, resulting in positive bias in the sample mean. Because of assumed 
linear mapping from degree to sampling probability, fi SH and /t V H over- 
correct for this feature, resulting in negative bias. fiss and jx M A appropri- 



12 



o 

C\] 



o 

00 

o 



C\J 



■ 


Mean 


□ 


SH 


□ 


VH 


□ 


SS 


□ 


MA 



CD 

-I— ' 
CO 

.1 o 

-t — ' 

C/5 

111 

CD 
o 
c 

_CD 
CO 

> 




o 

OJ 



o o 
9 



Samp % 50% 
w 1 
R 1 
Seeds Random 



70% 
1.8 
1 

Random 



° o 



-B-8- 



50% 
1 
5 

Infected 



70% 
1.8 
5 

Infected 



Figure 1: Comparison of performance of five RDS estimators under four 
conditions. fi ss and £l m a assume correct population size N. Results from 
1000 simulations. 



ately adjust for the over-sampling of infected nodes, resulting in unbiased 
estimators without increased variance. 

The third section of Figure 1 considers the case the new estimator, [ima, 
is designed to address. There is homophily (R = 5), and all seeds are 
selected from among the infected nodes. This case treats a smaller sample 
fraction (N = 1000) and activity ratio 1 (w — 1). Here, all of the earlier 
estimators exhibit bias due to the selection of seeds (note the direction 
of bias is different for JIsh), while the proposed estimator appropriately 
corrects for the selection of seeds. 

The final case considers the joint effects of large sample fraction (N = 
715), non- activity ratio (w = 1.8), homophily (R = 5), and biased se- 
lection of seeds (all infected). Here, the sample mean over-represents 
the higher-degree and initially sampled infected nodes. /2vh exhibits a 
strong negative bias, similar to that in the second case. The two effects 
jointly cause tremendous bias in (x S h- frss is affected by seed bias, al- 
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though not by the sample fraction. Here, again, the proposed estimator 
correctly adjusts for all of these effects. Although in this example the ef- 
fects of sample fraction/ activity ratio are of larger magnitude than those 
of seed bias/homophily, in practice the relative magnitudes of these will 
vary across data sets. 




r 1 3 5 Max Wave 4 5 6 

Seeds 20 10 6 



(a) (b) 

Figure 2: Comparison of the performance of five RDS estimators with bi- 
ased seed selection and various levels of homophily and numbers of sam- 
pling waves. All treat N = 1000, w = 1, and all seeds are selected from 
among infected nodes only. The first subfigure illustrates the exacerbating 
effect of homophily on seed bias. The second illustrates the ameliorating 
effect of increased sampling waves on seed bias. 

Figures 2(a) and 2(b) illustrate additional features important to the im- 
pact of seed selection on the bias of the sample mean and earlier estima- 
tors. In particular, Figure 2(a) illustrates that the bias is exacerbated by 
increased homophily in the underlying population, and Figure 2(b) illus- 
trates that bias is attenuated by having more sampling waves (attained by 
selecting fewer seeds for fixed sample size and branching). In each of 
these cases, the proposed estimator has negligible bias. Note that for very 
high levels of homophily (R = 13), the proposed estimator was found to 
exhibit positive bias, but of much smaller magnitude than that of the other 
estimators. 
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4.3 Sensitivity to the Population Size Estimate 

In practice, the size of the hidden population, N, may not be known. We 
therefore conduct a sensitivity analysis illustrating the performance of the 
proposed estimator in the case of an inaccurate estimate N of N. We con- 
sider cases of N-±(N-ri) = N s < N and N+^(N—n) = N t > N. For each 
treatment of N, we treat the four cases in Figure 1. p, and /t V H correspond 
to the extreme cases of N = n and N = oo, respectively, so are plotted for 
reference alongside p, S s and fa ma for each level of N. 

Figures 3(a) and 3(c) illustrate that in the case of unity activity ratio and 
a smaller sample fraction, the assumed population size has little impact on 
the resulting estimators (note that in the case with N = 715 in Figure 3(c), 
seed bias leads to the perception of a non-unity activity ratio which, along 
with a smaller assumed population size results in the perception of a larger 
sample fraction). 

Figures 3(b) and 3(d) illustrate that in the case of large sample fraction 
and activity ratio (w ^ 1), the assumed sample fraction does impact the 
estimators fa ss and (Ima- When there is no seed bias (Figure 3(b)), these 
estimators perform nearly identically. Gile (2011) argues that in this case, 
fass interpolates between the sample mean and /} V H/ and that trend seems 
to hold for fa MA as well. In the case of seed bias (Figure 3(d)), however, 
fass and fa M A differ, in that fa M A corrects for the bias induced by the seed 
selection. 

Finally, it is worth noting that for smaller sample fractions, such as 
in Figure 3(c), the bias induced by seed selection may be of far greater 
magnitude than the bias induced in /2 V h by finite population effects. For 
this reason, for smaller sample fractions, fauA may be able to correct for the 
more important form of bias, without being greatly affected by uncertainty 
in the population size. 

4.4 Sensitivity to the network working model 

The role of the network working model is to provide a (stochastic) rep- 
resentation of the networked population. This model is the basis of the 
improved representation of the RDS design leveraged by the proposed 
estimator. The complexity of real-world social networks is high, so that 
simple network models will typically only capture a subset of this com- 
plexity. 
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The ERGM in (2) is designed to represent two levels of network struc- 
ture that are important to model RDS. The first is the nodal level repre- 
sentation of the individual heterogeneity in the propensity to have social 
ties. This is via the nodal degrees which are also a measure of the cen- 
trality of the individuals in the network (Freeman, 1979). The second is at 
the dyadic level and captures the homophily, or propensity for ties to be 
between individuals with the same infection status (beyond that implied 
by the infection prevalence). As infection status is the primary outcome 
of interest, this homophily is the most important to capture. The model 
(2) does not capture third level triadic effects, those based on the struc- 
ture of triads of relations between individuals. While these are tertiary to 
the monadic and dyadic effects they can influence the RDS. Unfortunately 
RDS results in branching tree patterns of observations that limit the em- 
pirical information on these triadic effects. Hence the model (2) presumes 
that the triadic effects are those that would be produced by the modeled 
monadic and dyadic components. 

The purpose of this section is to assess the sensitivity of the estimator to 
this misspecification of the triadic effects. Explicitly, we will consider net- 
worked populations with higher levels of transitivity than specified in the 
network working model and compare the performance of the estimators. 
Transitivity is represented by the edgewise shared partner (alter) statistics, 
denoted EP (y), . . . , EP7v-2(y)/ where EP fc (y) is defined as the number of 
unordered pairs i,j such that y^ = 1 and i and j have exactly k common 
alters. It is a measure of the shared friendliness of friends. The geometri- 
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Figure 3: Sensitivity of £i S s and £l M a to inaccurate population size under 

four network and sampling conditions. Gile (2011) argues that fx S s ap- 
proaches the sample mean for small assumed population sizes, and ap- 
proaches /ivH for large assumed population sizes. 

cally weighted edgewise shared partner (GWESP) statistic, conditional on 
the 9 parameter, is 

N-2 

gwesp e (y) = e * £ [l - (1 _ e^Y] EP,(y) 9 > 0. 

i=i 

The GWESP is an aggregate measure of local clustering or the overall "in- 
wardness" of ties. The parameter 9 controls just how "local" the clustering 
needs to be. If 9 = an edge with one shared partner counts the same 
as an edge with two or more shared partners. If 9 > an edge with one 
shared partner counts less than an edge with two or more shared partners. 
So large values of 9 mean that very tight clustering is highly weighted and 
loose clustering is emphasized less. These terms have been developed for 
ERGM by Snijders, Pattison, Robins and Handcock (2006) and Hunter and 
Handcock (2006). 

Most real-world networked populations over which RDS will be ap- 
plied may be expected to have higher levels of transitivity than that pro- 
duced by monadic and dyadic effects. Here we will consider two ways 
to produce networks with higher propensities for "friends of friends to be 
friends". To investigate the relative performance of the estimator in pop- 
ulations with higher transitivity, we generate networks with exactly the 
same monadic and dyadic statistics as those considered in Section 4.2 but 
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Figure 4: Comparison of performance of five RDS estimators when the 
network working model misspecifies the transitivity. The populations in 
the left panel have ten times the GWESP with 9 = and the right panel 
with 6 = 1. Results from 1000 simulations. 



with higher transitivity as measured by the GWESP statistic. We do this by 
adding a gwesp e (y ) term to the model (9) and inflating the mean value pa- 
rameter of the gwesp 6) (y) while holding the mean value parameters of the 
other terms, as well as the degrees of each node, fixed. We then simulate 
networks from the resulting model, and apply the sampling and estima- 
tion procedures. 

To test the correction for seed bias under model misspecification, we 
consider the populations in the third section of Figure 1. These have a 
smaller sample fraction (N = 1000), high homophily (R = 5), no differ- 
ential activity (w = 1) and biased selection of seeds (all infected). We take 
the same 1000 populations and re-generate them with the exactly the same 
degree sequences and homophily but with increased transitivity (as mea- 
sured by the gwesp 6) (y)). 

Figure 4 compares the same estimators as in Figure 1. The left panel 
consider populations with gwesp (y) ten times that in the original. A 
value of 9 = means that the statistic measures the number of pairs of 
people that are connected both by a direct edge and by a two-path through 
another person (that is, the number of edges minus the number of edges 
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connected by no two-paths). As can be seen, the performance of {1m a is 
little effected by the increased transitivity The right panel compares pop- 
ulations with ten times gwesp 1 (y). A value of 9 = 1 means that the statistic 
weighs up the connectedness of edges with more weight on the terms with 
more shared partners. In this case {lma has modest positive bias (0.46%) 
and similar variance compared to the estimators on the original popula- 
tions. 

5 Application to HIV prevalence in Hidden Pop- 
ulations 

We apply our estimator to data collected in 2007 among injecting drug 
users (IDU) in Mykolaiv, Ukraine. The HIV epidemic in the Ukraine is 
one of the most severe in Europe, and still growing. As of 2009, the adult 
HIV prevalence was estimated at 0.86% (Ukrainian AIDS Centre, 2009). 
Ukraine's epidemic is most severe among injecting drug users and their 
sexual partners, who account for the majority of new infections (United 
States Agency for International Development, 2010). The data we con- 
sider here were collected as part of a series of studies of IDU across major 
Ukrainian cities in 2007 (Kruglov, Kobyshcha, Salyuk, Varetska, Shakar- 
ishvili and Saldanha, 2008). We focus on the data collected in Mykolaiv 
because, by chance and because of the contacts available to the researchers, 
all seeds in this sample were HIV positive. 

This study began with 6 seeds and continued until wave 10, with 31 
samples from wave 10, and a total of 260 samples. The average wave 
number was 6.1. The homophily based on HIV status for the population 
is estimated to be R = 2.47 and the differential activity is estimated to be 
0.72. 

Although the size of the population was not known precisely, an es- 
timated range of population sizes is available through scale-up and mul- 
tiplier methods (Kruglov et al., 2008; UNAIDS/WHO, 2003). We chose a 
population size, iV = 4000, near the low end of this range. The variability 
of population size estimates is quite large, with a point estimate closer to 
8000 in 2008 (Berleva, Dumchev, Kobyshcha, Paniotto, Petrenkon, Saliuk 
and I. A. Shvab, 2010)). We used sensitivity analysis to verify that pop- 
ulation size 4000 is sufficiently large that our estimates are insensitive to 
increased population size. 
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Figure 5: HIV prevalence estimates for injecting drug users in Mykolaiv, 
Ukraine, 2007. 



We compare estimates for this application based on the current stan- 
dard estimators (fisn, Avh/ Ass) and three variants of fiMA, summarized in 
Figure 5. First, we consider a version of fiMA which does not correct for 
seed bias. In this case, we select the seeds of the simulated samples with 
probability proportional to degree and without regard to infection status. 
As illustrated in Figure 5, this results in an estimate very close to that given 
by fiss and /tvH (Avh = 0.844, jigs = 0.843, [ima = 0.845). In the second 
condition, we then apply the correction for seed bias, by matching the 
simulated seeds to the infection and degree characteristics of the observed 
seeds. This results in the second estimate in Figure 5, (lma = 0.837. This 
adjustment is in the direction we would expect, decreasing the prevalence 
estimate, corresponding to down-weighting the group over-represented 
in the seeds. The modest magnitude of this adjustment can be attributed 
to the weak homophily in this network, relative to the number of sample 
waves, as well as the high prevalence, leading to a smaller difference be- 
tween prevalence in the population and prevalence among the seeds than 
in our simulation study. 

The third condition we considered highlights the flexibility and possi- 
bilities for extensions of fx ma- We note that in these data, infection groups 
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differed in their recruitment behavior. Some differences in recruitment 
behavior have been referred to as differential recruitment effectiveness in 
Heckathorn (2007), as well as in Tomas and Gile (2010) and Guntuboyina, 
Barbour and Heimer (in preparation). This is a pattern in which one group 
systematically recruits more effectively than another group. In this case, 
however, the pattern was more complex. On average, infected and unin- 
fected participants did not vary greatly in their recruitment effectiveness. 
However, uninfected participants in the early waves of the study recruited 
disproportionately few additional participants, as illustrated in Figure 6. 
Because of the branching nature of the sampling, this resulted in a dra- 
matic under-representation of uninfected IDU in the survey. To correct for 
this, however, we needed to estimate and replicate offspring distributions 
varying by both infection status and survey wave. 



Wave Uninfected Recruiter Avg Infected Recruiter Avg 

10 I 7 1 I 24 1 

9 | 8 12 IWcM 0.93 | 17 HI ■.-■ 0.75 

8 | 4 I ■iM l.25 I 15 ~[2T1M 1.08 

7 I 2 I 1 I ~^1^M 1.71 | 10 l?l Bl 1.1 

5 |1| 2 I 1 1 I 9 Til MM 1.28 

4 | 2 I 2 1 0.5 I 11 1 MP" 95 

3 | 6 | ~^M^M 1.67 

2 I 1 1 I 8 "TTTM 1.07 

o 1 Hi^^^E^H2.33 



Total | 30 I 8 I Beao.89 | 119 l?0l EM 0.99 



Legend: Number of Recruits: □ o □ 1 ■ 2 ■ 3 
Figure 6: Distribution of number of successful recruitments by wave and 
infection status of recruiter. Uninfected recruiters were rare and unsuc- 
cessful in early waves, contributing to under-representation of uninfected 
participants in the sample. There were no uninfected participants in waves 
(seeds) or 3. 

We therefore applied a version of fiMAr modified to reflect the empiri- 
cal offspring distribution by wave and infection status. In most cases, this 
required simply assigning an offspring distribution equal to the empiri- 
cal offspring distribution by wave and infection status. For uninfected re- 
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cruiters in wave 3, we used averaged empirical values from waves 2 and 4. 
For waves 10 and beyond, we replicated the empirical results from wave 9. 
Whenever a simulated recruiter did not have enough eligible alters to al- 
low for the number of recruits selected from the appropriate distribution, 
we assigned any unfulfilled recruitments to the next active recruiters of the 
same infection status with fewer than 3 assigned recruits. This is straight- 
forward to apply in our model-assisted setting and illustrates how this ap- 
proach allows the approximation to the sampling design to be improved 
using available information. 

The results of this analysis are illustrated in the third bar of Figure 5. 
The resulting estimate, 0.817, was substantially lower than the earlier es- 
timates, suggesting the offspring distribution had a substantial impact on 
the resulting estimates. 

6 Discussion 

In this article we introduce a new approach to estimation based on RDS 
data that uses a working model for the underlying networked popula- 
tion to more accurately estimate the inclusion probabilities necessary for 
design-based inference. 

We demonstrate that this approach allows us to correct for differential 
sampling probabilities based on nodal degrees, as in earlier RDS estima- 
tors (Heckathorn, 2007; Salganik and Heckathorn, 2004; Volz and Heckathorn, 
2008), as well as for finite population biases, as addressed by another ear- 
lier approach (Gile, 2011). In addition, our proposed estimator is able to 
adjust for the convenience sample of seeds, a feature not accounted for in 
any previous approaches. 

We apply this approach to obtain improved estimation of HIV preva- 
lence in an IDU population in the Ukraine. We improve the approximation 
to the actual RDS process resulting in improved estimates, and compute 
associated measures of uncertainty. We also show the flexibility of the 
working model approach. It allows for additional information available in 
a particular application to be incorporated via the ERGM framework, and 
leverages recent advances in that area (Handcock, Hunter, Butts, Goodreau 
and Morris, 2008; Snijders et al., 2006). A significant weakness of our ap- 
proach is the requirement that the population size is known. Our sim- 
ulation study illustrates that the proposed estimator is indeed sensitive 
to estimates of population size, but as long as the population size is not 
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greatly under-estimated, we do not expect it to perform worse than the 
earlier estimator /} V H/ and in cases of bias induced by the sample of seeds, 
fiMA may perform substantially better than any existing estimator, even 
for highly inaccurate assumptions regarding the population size. 

We therefore propose three practical regimes for the use of (Ima- First, 
if the population size is known, the estimator may be applied directly 
If the population size is unknown, but a range of estimates is available, 
the estimator may be applied across the range. If the results vary greatly, 
the uncertainty of the resulting estimator should be adjusted accordingly. 
Such may be the case, for example, in populations of men who have sex 
with men, who are assumed to constitute 1% - 3% of many populations. 
Finally, if no information on population size is available, jlss may be com- 
pared with /tvH to determine whether there are important finite popula- 
tion effects in this sample. If not, \xma may then be applied to correct for 
any effects of seed bias. In the case where finite population effects are 
found, fiMA may be applied to diagnose the extent of seed bias at each of 
several estimates of population size. Note that in such cases, the earlier 
estimators also make an assumption about population size (i.e. that it is 
sufficiently large), and so do not avoid the problem of unknown population 
size. 

Another important assumption is the form of the social network work- 
ing model. Our estimator relies on a simple model, not because we believe 
it to be strictly accurate, but because we expect it to capture the network 
features most important to the sampling process, and because it is feasible 
to estimate from the available data. To assess the sensitivity of the esti- 
mator to the form of the working model we considered versions of pop- 
ulations with greatly increased transitivity, a feature not captured by the 
working model. The results indicate only modest impact of high transi- 
tivity on the estimator or it uncertainty. While this may not be universally 
true, it does indicate the ability of the working models to capture nodal 
and dyadic effects goes a long way to improve the representation of the 
RDS process. 

Several extensions of this approach are possible. First, if data on the 
characteristics of all alters are not available, we may wish to estimate the 
sum of cross-group ties (g(y,x)) based on referral patterns. Such an esti- 
mate is used in the application to HIV prevalence estimation (Section 5). 

Our approach can also be extended to include additional measurable 
features of the network working model or sampling process, such as ho- 
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mophily on neighborhood of residence or bias in the passing of coupons. 
We illustrate one such extension in Section 5, in which we observe an aber- 
rant pattern of recruitment by infection status, and adapt the estimator to 
condition on this pattern. Note that the resulting estimate is very close to 
that given by £lsh- This is consistent with results in Tomas and Gile (2010) 
indicating that fisH is not as susceptible to bias induced by differential re- 
cruitment effectiveness as /tyH or jigs- 

Further extensions of this approach will make it possible to consider 
the joint estimation of population size and prevalence, or correlations be- 
tween multiple nodal variables. We explore these features in ongoing 
work. 

We intend to make code available for these procedures in the R package 
RDS on CRAN (Handcock, Gile and Neely, 2009; R Development Core 
Team, 2007). 

Inferential and Computation: This supplement presents specifics of the 
estimation algorithms and our approach to standard error estimation 
(RDSMAsupplement.pdf) 
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1 Estimation Procedure 

We propose the following algorithm to compute the new estimator jl M A of 



1. Estimate the following according to their empirically observed val- 
ues: 

• Sample size n 

• Number of seeds n seeds , and degree and infection status of seeds, 
given by W eeds = {Nfj eds }, where Nfj eds represents the number of 
seeds with degree % and infection j, % e 1 . . . N — 1, j e {0, 1}. 

• Offspring distributions p s , where pf = the proportion of the 
sample with % offspring, i — 0,1, ... , maximum number of coupons. 

2. Estimate: 




3. 



For r = 1 . . . h: 
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(b) Compute the ERGM parameter 77 in the model (2) based on 
N r and g(y,x) r via the procedure in Supplemental Section 3. 
Denote the estimate by rf . This step is conducted using the 
statnet R package (Handcock, Hunter, Butts, Goodreau and 
Morris, 2003). 

(c) Simulate M x networks according to the distribution given by 
ff, W, and g(y, x) r , also using the statnet R package. 

(d) Simulate M 2 RDS samples from each of the Mi networks in the 
previous step, according to sampling parameter y = {n, pj seeds }, p s 
Let U r kl represent the number of times a node of degree k and in- 
fection I is sampled, over all M = Mi x M 2 samples. 

(e) Estimate tt\ V % : Sj = 1 in a manner similar to Fattorini (2006) 
and Gile (2011): 

V U ^ + 1 



M ■ + 1 



4. Let ifi = irf 

5. Estimate 



Z_/i=l i ri /c 1 \ 



The simulations in this paper are based on r = 3 iterations, each including 
Mi = 25 network samples and M 2 = 20 RDS samples from each network. 
In general, we recommend at least M x = 25, M 2 = 20 and r = 3. Estimation 
time scales with sample size, population size, and M. In our simulations, 
with iV = 1000, estimates require about 20 minutes on a personal com- 
puter. In practice, these parameters can be adjusted for desired precision 
in the solution to (7). We will make available the code to compute this es- 
timator in the RDS R package on CRAN (Handcock, Gile and Neely, 2009; 
R Development Core Team, 2007). 
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2 Measures of Uncertainty: Bootstrap 



Unlike earlier RDS estimators, the estimator given in (7) allows for estima- 
tors of uncertainty that account for the estimated full relational structure 
of the underlying population as well as incorporating several observable 
features of the sampling process. The former is because of the use of the 
network working model for the population over which the RDS sampling 
procedure operates. The latter is because our procedure enables the sim- 
ulation of complex RDS designs. In particular, if we believe there is seed 
bias we can incorporate it into the sampler, and if there is measurable sam- 
pling bias (as in the application) we can incorporate that also. This allows 
the procedure to incorporate available information about the population 
and sampling and greatly improves the accuracy of the representation of 
the actual sampling process. The accuracy of the bootstrap depends di- 
rectly on the quality of the approximation to the actual sampling process. 

We propose a parametric bootstrap approach to obtaining confidence 
intervals, according to the following procedure: 

1. For b = 1,...B, iterate the following steps: 

(a) Simulate a network Y b from the model given in (2) with pa- 
rameters i] = fj h , and N h where h is the final iteration of the 
algorithm in Supplemental Section 1. 

(b) Simulate one RDS sample Sy 5 with parameter y h from Y 6 . 

(c) Compute an estimator /2ma(&) of \x based on the sample Sy b us- 
ing the algorithm in Supplemental Section 1. 

2. Use the empirical distribution S>{Jima) of Ama(2), . . . plma(B)} 
to estimate the distribution of fi M A under the estimated model form. 

The distribution of &(P,ma) m ay then be used to form confidence inter- 
vals for /i which account for the full estimated relational structure as well 
as observable biases of the sampling process. We use the standard devia- 
tion of the resulting population of B bootstrap estimates as an estimate of 
the standard error of $l ma . We have used B = 1000 bootstrapped samples. 
In our simulations, this procedure took about 20 minutes per sample on 
a single processor. Parallelization is straightforward and dramatically re- 
duces elapsed time. A large additional speedup can be obtained by replac- 
ing step (c) with (c') in which the weight classes 7if from (S.l) are reused to 
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weight each bootstrap sample. While these weights vary from bootstrap 
sample to sample, their uncertainty is a small part of the overall uncer- 
tainty This reduces the procedure to about 30 seconds per sample on a 
single processor. The analysis below uses (c')- 

We illustrate the performance of this standard error estimator by com- 
paring five critical cases. As with the point estimate, we illustrate both 
cases in which we expect the estimator to perform reasonably well, and 
a case in which we expect the estimator to perform poorly. We introduce 
various forms of sampling biases. The initial sample can be selected ei- 
ther independent of infection status (denoted "No" in the bias column of 
Table 1) or all from within the infected subgroup ("Initial" bias). We also 
introduce referral bias where all infected alters are 20% more likely to be 
referred than uninfected alters ("Referral" bias). 

Each set of simulations involved 1000 bootstrapped re-samples for each 
of 1000 simulated RDS samples. The parameters of the samples, average 
estimated standard errors, and coverage rates of nominal 95% and 90% 
confidence intervals are given in Table 1. 

Table 1: Observed (simulation) standard errors of estimates, and average 
bootstrap standard error estimates, along with coverage rates of nominal 
95% and 90% confidence intervals for procedure given in Supplemental 
Section 2 for varying sample proportion, homophily R, and activity ratio 
w, and for various biases in the sample selection process. Observed stan- 
dard errors are based on 1000 samples. Bootstrap standard errors are the 
average bootstrap standard error estimates over the same 1000 samples. 
Nominal confidence intervals are based on quantiles of the Gaussian dis- 
tribution. 
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SE 


SE 


coverage 


coverage 


sample 


R 


w 


bias 


observed 


bootstrap 


95% 


90% 


50% 


1 


1 


No 


0.0140 


0.0137 


94.1% 


88.8% 


70% 


1 


1.8 


No 


0.0073 


0.0075 


94.9% 


90.4% 


50% 


5 


1 


Initial 


0.0188 


0.0175 


93.7% 


87.9% 


50% 


5 


1.8 


Initial 


0.0079 


0.0080 


95.0% 


87.3% 


50% 


5 


1 


Referral 


0.0216 


0.0225 


91.7% 


84.7% 
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The magnitudes of the average bootstrap standard error estimates are 
quite close to the observed values in the first four cases, and the cover- 
age rates in the cases without referral bias are very close to their nominal 
values. In this last case, the standard error estimator is anti-conservative 
because the bootstrap procedure does not replicate the referral bias in the 
sample. 

The last row of Table 1 illustrates the poor performance of the estimator 
in the case of extreme referral bias. In this case, the estimator (ima has 
positive bias (1.74%), leading to moderately lower coverage rates of the 
nominal intervals. 

3 Inference for the ERGM conditional on the de- 
gree and infection status sequences 

The model-assisted approach is based on a "working" model (2) for the 
networked population. The unknowns in the model are the finite-population 
values d and z and the super-population parameter r\. Finite-population 
estimates of N (i.e., d and z) and g(y, x) are determined by design-based 
inference as in (5) and (6). The estimate of r\ is computed as the natural 
parameter in (2) corresponding to these values. That is, the natural pa- 
rameter corresponding to the mean-value parameter <?(y,x) conditional 
on the degree sequence d and infection status sequence z induced by N. 
Explicitly, we construct the joint degree and infection status sequence d, z 
from N, where the ordering of nodes is arbitrarily assigned (w.l.o.g.). To 
compute i] we construct a network with this joint degree and degree status 
sequence and cross-group contacts g(y,z) using the Reed-Moll oy method 
and then simulated annealing (Handcock et al., 2003; Handcock, Hunter, 
Butts, Goodreau and Morris, 2008; Molloy and Reed, 1995). 

We can then compute fj using the Geyer-Thompson MCMC approach 
(Handcock et al., 2003, 2008). As this is computationally expensive and 
unstable in this situation we use an approach based on a form of pseudo- 
likelihood introduced below. 

Consider a model similar to (2) but with network space & consisting 
of all binary undirected networks (i.e., unconditional on d and z). Un- 
til recently inference for such models have been almost exclusively based 
on a local alternative to the likelihood function referred to as the pseudo- 
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likelihood (Besag, 1975; Strauss and Ikeda, 1990). Consider the conditional 
formulation of this model: 

logit[P(y ij = 1\Y£ = y£, 77)] = itf (y£) yGf (S.2) 

where £(y?-,z) = ^(yj, z)-#(y~., z), the change in g(y, z) wheny^ changes 
from to 1 while the remainder of the network remains y?- (See Strauss 
and Ikeda, 1990). The pseudo-likelihood for the model is: 

e P ( m y) = V J2 S(y z)y y - £ log [l + exp(itf(y&, z))] . (S.3) 

This is the standard form of pseudo-likelihood, which we refer to a dyadic 
pseudo-likelihood. 

This form is algebraically identical to the likelihood for a logistic re- 
gression model where each unique element of the adjacency matrix, y^, 
is treated as an independent observation with the corresponding row of 
the design matrix given by S(y^, z). Then the maximum likelihood esti- 
mate (MLE) for this logistic regression model is identical to the maximum 
dyadic pseudo-likelihood (MPLE) for the corresponding ERG model, a 
fact that is exploited in computation. Therefore, algorithms to compute 
the MPLE for ERGMs are typically deterministic while the algorithms to 
compute their MLEs are typically stochastic. In addition, algorithms to 
compute the MLE can be unstable if the model is near degenerate (Hand- 
cock, 2003). This can lead to computational failure. 

This standard form of pseudo-likelihood is inappropriate for the model 
(2) as it does not take into account the network space ^(z, d). This is be- 
cause P(Yij = 1|Y£- = yfj , 77) is either 1 or depending on if the value of 
Yij because the model conditions on the degree sequence consistent with 
d. Hence the MPLE will usually produce non-sensical results. 

Instead of a dyadic pseudo-likelihood we develop a tetradic pseudo- 
likelihood. We focus on ordered dyad-quads y ijM = (y^, y kh y u , y jk ) such 
that y {j = y kl = l,y u = y jk = 0. We refer to this configuration as yj w . For 
each such dyad-quad there exists an alternative realization in which y^ = 
yn = 0, yu = y jk = 1. We refer to this configuration as y~ jkl . Thus y± kl and 
y~ jkl represent a pair in which y^ is toggled from 1 to in such a way as to 
retain the degree and infection status sequences of the corresponding full 
network. 
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Let Y ijkl = (Y ij, Y kh Y ih Y jk ), Y c ijH = Y\Y ijkl and y c ijkl = y\y ljki . Let 
S = {ijkl : y-^Uy^j e ^(z, d)}. For these dyad-quad configurations we 
then have: 

Iogit[P(Yy„ = yj„| Y?. H = y£„, 77)] = ^(y^-„, *) y W G ^ (S.4) 

where 5(y c ijkl , z) = ^(y?. fcJ U y± w z) - ^(y?. fc , U y~ jW z), the change in g(y, z) 
when y ijfc ; changes from y~ jkl to y^ fc/ . The tetradic pseudo-likelihood for 
model (2) can then be defined as: 

£ PT (v,y) = vYl s (yijki> z )Kyw = ySw)- lo s i 1 + ex p(^(y^>z))] • 

(S.5) 

As \S\ is large, we take a large random sample of them (N = 100000) and 
use the sample mean to approximate (S.5). This procedure is implemented 
in the statnet R package (Handcock et al v 2003). 

While the MPLE is know to be inferior to the MLE for dyadic depen- 
dence models (van Duijn, Handcock and Gile, 2009) it is equivalent to the 
MLE for some dyadic independence models. For the model (2) the net- 
work statistic is weakly dependent on the set of networks with the given 
degree and infection sequences. Hence the maximum tetradic pseudo- 
likelihood (MTPLE) might be expected to perform well for this model. 
This does seem to be the case for the models considered in this paper. In 
simulations (not shown here) as it appears to be indistinguishable from the 
MLE (where the later is computed by a computationally expensive MCMC 
procedure). The advantages of the tetradic MPLE are that it is computa- 
tionally stable and fast while being numerically indistinguishable from the 
MCMC-MLE. For these reasons we use it in all simulations in this paper. 

This estimator could be improved by adding hexadic configurations 
to the pseudo-likelihood. These are necessary for sampling algorithms to 
cover the full network space (Rao, Jana and Bandyopadhyay, 1996). How- 
ever they also lead to more complex algorithms and will be considered in 
other work. 
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